Search CORE

Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages

Author: Cohen Albert
Drach Nathalie
Drebes Andi
Heydemann Karine
Pop Antoniu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2014
Field of study

International audienceWe present a joint scheduling and memory allocation algorithm for efficient execution of task-parallel programs on non-uniform memory architecture (NUMA) systems. Task and data placement decisions are based on a static description of the memory hierarchy and on runtime information about intertask communication. Existing locality-aware scheduling strategies for fine-grained tasks have strong limitations: they are specific to some class of machines or applications, they do not handle task dependences, they require manual program annotations, or they rely on fragile profiling schemes. By contrast, our solution makes no assumption on the structure of programs or on the layout of data in memory. Experimental results, based on the OpenStream language, show that locality of accesses to main memory of scientific applications can be increased significantly on a 64-core machine, resulting in a speedup of up to 1.63× compared to a state-of-the-art work-stealing scheduler

The University of Manchester - Institutional Repository

Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management

Author: Cohen Albert
Drach Nathalie
Drebes Andi
Heydemann Karine
Pop Antoniu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/09/2016
Field of study

International audienceDynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. Yet these promises are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and memory resources for task-parallel programming models while achieving high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system and placement information from the operating system. We achieve 94% of local memory accesses on a 192-core system with 24 NUMA nodes, up to 5× higher performance than NUMA-aware hierarchical work-stealing, and even 5.6× compared to static interleaved allocation. Finally, we show that state-of-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications

Crossref

The University of Manchester - Institutional Repository

Direct-mapped versus : set-associative pipelined caches

Author: Drach Nathalie
Seznec André
Windheiser Daniel
Publication venue: HAL CCSD
Publication date: 01/01/1994
Field of study

Disponible dans les fichiers attachés à ce documen

Architecture des ordinateurs

Author: Daumas Marc
Drach Nathalie
Publication venue: Hermes Science Publications
Publication date: 01/01/2010
Field of study

National audienc

Etude de quelques organisations d'antémémoires

Author: Drach Nathalie
Seznec André
Publication venue: HAL CCSD
Publication date: 01/01/1992
Field of study

Les performances des systèmes batis autour d'un microprocesseur dépendent de plus en plus des performances de la hiérarchie mémoire, et plus particulièrement des antémémoires. En effet depuis ces dernières années, le temps de cycle du processeur a diminué beaucoup plus vite que le temps d'accès à la mémoire principale. Cette tendance a augmenté l'importance des antémémoires et de leur efficacité. Dans ce rapport, nous présentons trois nouvelles organisations d'antememoires situees sur la meme puce que le processeur et qui maximisent les performances des microprocesseurs. La premiere organisation est dérivée de l'organisation d'antémémoire unifiée à laquelle on a ajouté un tampon instructions. La deuxieme organisation est composée d'une antémémoire unifiée instructions et données et d'une antémémoire séparée instructions. Enfin nous présentons l'organisation d'antémémoires semi-unifiées. Cette organisation est composée de deux antémémoires, C1 et C2, séparées physiquement, mais destinées à recevoir instructions et données. L'antémémoire C1 est à la fois antémémoire principale pour les instructions et antémémoire secondaire pour les données (C2 est antémémoire principale pour les données et antémémoire secondaire pour les instructions). Ainsi le degré d'associativité pour les données et les instructions est artificiellement augmenté et l'espace de stockage se répartit dynamiquement entre instructions et données. Les antémémoires semi-unifiées diminuent le taux de défaut d'antémémoires par rapport aux organisations d'antémémoires couramment utilisées dans les microprocesseurs

MODEE : smoothing branch and instruction cache miss penalties on deep pipelines

Author: Drach Nathalie
Seznec André
Publication venue: HAL CCSD
Publication date: 01/01/1993
Field of study

Pipelining is a major technique used in high performance processors. But a fundamental drawback of pipeling is the lost time due to branch instructions. A new organization for implementing branch instructions is presented : the Multiple Instruction Decode Effective Execution (MIDEE) organization. All the pipeline depths may be addressed using this organization. MIDEE is based on the use of double fetch and decode, early computation of the target address for branch instructions and two instruction queues. The double fetch-decode concerns a pair of instructions stored at consecutive addresses. These instructions are then decoded simultaneously, but no execution hardware is duplicated,only useful instructions are effectively executed. A pair of instruction queues are used between the fetch-decode stages and execution stages, this allows to hide branch penalty and most of the instruction cache misses penalty. Trace driven simulations show that the performance of deep pipeline processor may dramatically be improved when the MIDEE organization is implemented : branch penalty is reduced and pipeline stall delay due to instruction cache misses is also decreased

Semi-inified caches

Author: Drach Nathalie
Seznec André
Publication venue: HAL CCSD
Publication date: 01/01/1993
Field of study

Since the gap between main memory access time and processor cycle time is continuously increasing, processor performance dramatically depends on the behavior of caches and particularly on the behavior of small on-chip caches. In this paper, we present a new organisation for on-chip caches : the semi-unified cache organization. In most microprocessors, two physically split caches are used for respectively storing data and instructions. The purpose of the semi-unified cache organization is to use the data cache (resp. instruction cache) as an on-chip second-level cache for instructions (resp. data). Thus the associativity degree of both on-chip caches is artificially increased and the cache spaces respectively devoted to instructions and data are dynamically adjusted. The off-chip miss ratio of a semi-unified cache built with two direct-mapped caches of size S is equal to the miss ratio of a unified two-way set associative cache of size 2S ; yet, the hit time of this semi-unified cache is equal to the hit time of a direct-mapped cache ; moreover both instructions and data may be accessed in parallel as for the split data/instruction cache organization. Since on-chip miss penalty is lower than off-chip miss penalty, trace driven simulations show that using a direct-mapped semi-unified cache organization leads to higher overall system performance than using usual split instruction/data cache organization